1 Introduction

This report is an exploratory data analysis (EDA) of ESG (Environmental, Social, and Governance) assessment data for a variety of companies. The analysis uses SQL queries to extract insights from two databases downloaded from Kaggle.com that contain company profiles, ESG risk ratings, controversy scores, and other company-specific metrics. The purpose of this report is to showcase my SQL skills in querying, analyzing, and visualizing data from multiple tables to derive meaningful insights.

Skills exhibited: Data cleaning, joins, aggregations, filtering, and visualization (plotly and ggplot).

2 Data Overview

The dataset comprises two main tables from Kaggle.com: the S&P 500 ESG Risk Ratings” dataset compiled by Pritish Dugar in 2023, and Public Company ESG Ratings Dataset dataset compiled by Alistair King in 2023. These datasets are publically available. The datasets contain the following information:

(1). Public Company ESG Ratings Dataset: (a). Basic company information: ticker symbol, company name, currency, exchange, industry, logo URL, website URL (b). Environmental scores and rating: environment_score, environment_grade, environment_level (c). Social scores and rating: social_score, social_grade, social_level (d). Governance scores and rating: governance_score, governance_grade, governance_level (e). Overall ESG scores and rating: total_score, total_grade, total_level (f). Last processing date of the ESG data (g). CIK identifier

(2). S&P 500 ESG Risk Ratings: Symbol, Name, Address, Sector, Industry, Full Time Employees, Description, Total ESG Risk Score, Environment Risk Score, Governance Risk Score, Social Risk Score, Controversy Level, Controversy Score, ESG Risk Percentile, ESG Risk Level

3 SQL Queries and Visualizations

3.0.1 Evaluate Similarities Between the Datasets

The first query evaluates the similarities between the two datasets by joining them on the company ticker symbol and comparing the industry fields. The query uses the ‘LIKE’ function to compare companies with similar industry labels in both datasets.

# Define the SQL query to join the two tables
query0 <- "
SELECT d1.name AS Name, d1.industry AS D1_Industry, d2.industry AS D2_Industry
FROM public_esg d1
INNER JOIN sp_esg d2
ON d2.symbol = UPPER(d1.ticker)
WHERE 
    d1.industry LIKE '%' || d2.industry || '%'
    OR d2.industry LIKE '%' || d1.industry || '%'
ORDER BY d1.industry
"

# Execute the query
eval_dataset <- sqldf(query0)

# Display the result in a datatable
datatable(eval_dataset, options = list(
  scrollX = TRUE,  
  pageLength = 5   
))

3.0.2 ESG Risk Scores by Company

The following query retrieves the ESG risk scores (total, environmental, social, and governance) for companies, sorted alphabetically by name. The results are displayed in an interactive plot using plotly.

query1<- 
"SELECT 
    d1.name AS company_name, d1.industry, total_esg_risk_score, environment_risk_score, governance_risk_score, social_risk_score
FROM 
    public_esg d1
INNER JOIN 
    sp_esg d2
ON 
    d2.symbol = UPPER(d1.ticker)
WHERE total_esg_risk_score IS NOT NULL
ORDER BY d1.name ASC"

# Execute the query
esg_risk_scores <- sqldf(query1)
# Display the result in a datatable
datatable(esg_risk_scores, options = list(
  scrollX = TRUE,  
  pageLength = 5   
))

3.0.3 Average ESG Risk by Industry

The following query calculates the average ESG risk score for industries with a low ESG risk level. The results are displayed in a bar plot using ggplot. I use the ‘ROUND’ and ‘GROUP BY’ functions to calculate the average risk score for each industry.

query2 <- 
"SELECT 
    d1.industry AS Industry, ROUND(avg(total_esg_risk_score), 2) AS Average_Industry_Risk
FROM 
    public_esg d1
INNER JOIN 
    sp_esg d2
ON 
    d2.symbol = UPPER(d1.ticker)
WHERE esg_risk_level = 'Low'
GROUP BY d1.industry
ORDER BY Average_Industry_Risk ASC"

# Execute the query
avg_industry <- sqldf(query2)

Findings: The plot shows the average ESG risk scores for industries with a low ESG risk level. The ‘Auto Components’ industry has the lowest average ESG risk score, while the ‘Machinery’ industry has the highest average ESG risk score.

3.0.4 Companies with High Controversy Scores

The following query identifies companies with controversy scores exceeding 3, sorted by controversy level. The results are displayed in a datatable using the DT package.

query3 <-
"SELECT 
    d1.name as Name, controversy_level AS Controversy_Level, controversy_score AS Controversy_Score, d1.industry as Industry
FROM 
    public_esg d1
INNER JOIN 
    sp_esg d2
ON 
    d2.symbol = UPPER(d1.ticker)
WHERE controversy_score > 3
ORDER BY controversy_level"

# Execute the query
high_contro <- sqldf(query3)

datatable(high_contro, options = list(
  scrollX = TRUE,  
  pageLength = 5   
))

Findings: Companies with high controversy scores come from a variety of industries, including automobiles, media, and banking. The ESG practices of these companies are controversial and should not be replicated in an ideal ESG framework.

3.0.5 Companies with Exceptional Governance Scores

The following query identifies companies with governance scores exceeding 300 (from the public_esg dataset) and governance risk scores (from the sp_esg dataset) below 5. The results are displayed in a datatable using the DT package.

query4 <-
"SELECT 
    d1.name As Name, governance_score As Governance_Score, governance_risk_score As Governance_Risk_Score
FROM 
    public_esg d1
INNER JOIN 
    sp_esg d2
ON 
    d2.symbol = UPPER(d1.ticker)
WHERE governance_score > 300 
    AND governance_risk_score < 5
ORDER BY governance_score DESC"

# Execute the query
high_govern <- sqldf(query4)

# Format the output using the datatable function
datatable(high_govern, options = list(
  scrollX = TRUE,  # horizontal scrolling
  pageLength = 5   # Set the number of rows per page
))

Findings: There is some discrepency between the governance scores and governance risk scores. Companies with high governance scores may not necessarily have low governance risk scores. This could be due to differences in the scoring methodologies used by the two datasets. However, these companies have effective governance practices that should be emulated by others.

3.0.6 Processing Dates for Large Companies with Over 75,000 Employees

# Remove commas from 'full_time_employees' and convert it to numeric
sp_esg$full_time_employees <- as.numeric(gsub(",", "", sp_esg$full_time_employees))

# Convert 'last_processing_date' to Date format 
public_esg$last_processing_date <- as.Date(public_esg$last_processing_date, format = "%d-%m-%Y")

# Run the query
query5 <- "
SELECT 
    d1.name AS Name, last_processing_date AS Processing_Date, full_time_employees AS Full_Time_Employees
FROM 
    public_esg d1
INNER JOIN 
    sp_esg d2
ON 
    d2.symbol = UPPER(d1.ticker)
WHERE full_time_employees > 75000
  AND last_processing_date = '19098' OR last_processing_date = '19099'
ORDER BY last_processing_date DESC
"

# Execute the query
large_companies <- sqldf(query5)

# Convert last-processing-date back into date format
large_companies$Processing_Date <- as.Date(large_companies$Processing_Date, origin = "1970-01-01")

# Format the output using the datatable function
datatable(large_companies, options = list(
  scrollX = TRUE,  # horizontal scrolling
  pageLength = 5   # Set the number of rows per page
))

3.0.7 Companies in High ESG Risk Percentiles

The following query categorizes companies into ESG risk percentiles (Top 10%, Top 25%, Top 50%, Bottom 50%) and ranks them. The results are displayed in a datatable using the DT package and an interactive plot using the plotly package.

query6 <-
"SELECT *
FROM (
    SELECT 
        name AS Name, 
        esg_risk_percentile AS ESG_Risk_Percentile,
        CASE
            WHEN CAST(SUBSTRING(esg_risk_percentile, 1, 2) AS INT) >= 90 THEN 'Top 10%'
            WHEN CAST(SUBSTRING(esg_risk_percentile, 1, 2) AS INT) BETWEEN 75 AND 89 THEN 'Top 25%'
            WHEN CAST(SUBSTRING(esg_risk_percentile, 1, 2) AS INT) BETWEEN 50 AND 74 THEN 'Top 50%'
            WHEN CAST(SUBSTRING(esg_risk_percentile, 1, 2) AS INT) < 50 THEN 'Bottom 50%'
        END AS Category
    FROM 
        sp_esg
) subquery
WHERE Category IS NOT NULL
    AND Category = 'Top 10%' OR Category = 'Top 25%' or Category = 'Top 50%'
ORDER BY esg_risk_percentile DESC"

# Execute the query
esg_percentile <- sqldf(query6)

# Format the output using the datatable function
datatable(esg_percentile, options = list(
  scrollX = TRUE,  # horizontal scrolling
  pageLength = 5   # Set the number of rows per page
))

Findings: Companies in the Top 10% have the highest ESG risk scores, while those in the top 50% have lower ESG risk scores. The Top 10% category has overlap with the list of companies created with high controversy scores. This plot can assist companies to set benchmarks for their ESG practices based on their intended ESG Risk Percentiles.

4 Conclusion

This analysis is an exploratory data analysis of ESG Risk data from two different datasets with data visualizations. My SQL queries extract various insights, such as similarities between the datasets, ESG risk scores by company, average ESG risk by industry, companies with high controversy scores, companies with exceptional governance scores, processing dates for large companies, and companies in high ESG risk percentiles. The visualizations provide an interactive and informative way to explore the data and derive meaningful insights.

Future steps include:

  1. Expanding the analysis to include time-series trends.
  2. Comparing ESG scores with financial performance metrics.
  3. Developing predictive models for ESG risk assessment.